Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction

نویسندگان

  • Greg Kondrak
  • Colin Cherry
  • Shane Bergsma
  • Paul Lu
  • Russell Greiner
  • Kurt McMillan
چکیده

The field of molecular biology is growing at an astounding rate and research findings are being deposited into public databases, such as Swiss-Prot. Many of the over 200,000 protein entries in Swiss-Prot 49.1 lack annotations such as subcellular localization or function, but the vast majority have references to journal abstracts describing related research. These abstracts represent a huge amount of information that could be used to generate annotations for proteins automatically. Training classifiers to perform text categorization on abstracts is one way to accomplish this task. We present a method for improving text classification for biological journal abstracts by generating additional text features using the knowledge represented in a biological concept hierarchy (the Gene Ontology). The structure of the ontology, as well as the synonyms recorded in it, are leveraged using a simple technique that significantly improves the F-measure of some subcellular localization text classifiers. The usefulness of two importance weight measures (redundancy and TFIDF) is evaluated in conjunction with the additional features generated by the Gene Ontology hierarchy.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Term Generalization and Synonym Resolution for Biological Abstracts: Using the Gene Ontology for Subcellular Localization Prediction

The field of molecular biology is growing at an astounding rate and research findings are being deposited into public databases, such as Swiss-Prot. Many of the over 200,000 protein entries in Swiss-Prot 49.1 lack annotations such as subcellular localization or function, but the vast majority have references to journal abstracts describing related research. These abstracts represent a huge amou...

متن کامل

Improving subcellular localization prediction using text classification and the gene ontology

MOTIVATION Each protein performs its functions within some specific locations in a cell. This subcellular location is important for understanding protein function and for facilitating its purification. There are now many computational techniques for predicting location based on sequence analysis and database information from homologs. A few recent techniques use text from biological abstracts: ...

متن کامل

MultiLoc2 and SherLoc2: improved prediction of subcellular protein localization

The function of a protein is highly correlated with its subcellular localization. However, determining the subcellular localization of a protein experimentally can be difficult and time-consuming. Computational methods for the prediction of subcellular locations of proteins from the sequence alone are an attractive alternative. MultiLoc2 [1] and SherLoc2 [3] both significantly extend and improv...

متن کامل

Molecular Characterization of the Epstein-Barr Virus BGLF2 Gene, its Expression, and Subcellular Localization

Background: Epstein–Barr virus (EBV) is a universal herpes virus which can cause a life-long and largely asymptomatic infection in the human population. However, the exact pathogenesis of the EBV infection is not well known.Objective: A comprehensive bioinformatics prediction was carried out for investigating the molecular properties of the BGLF2 and to a...

متن کامل

Prediction of Protein Sub-Mitochondria Locations Using Protein Interaction Networks

Background: Prediction of the protein localization is among the most important issues in the bioinformatics that is used for the prediction of the proteins in the cells and organelles such as mitochondria. In this study, several machine learning algorithms are applied for the prediction of the intracellular protein locations. These algorithms use the features extracted from pro...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2006